---
id: storages
title: Storages
description: How to work with storages in Crawlee, how to manage requests and how to store and retrieve scraping results.
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import OpeningExample from '!!raw-loader!roa-loader!./code_examples/storages/opening.py';

import RqBasicExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_basic_example.py';
import RqWithCrawlerExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_example.py';
import RqWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_examples/storages/rq_with_crawler_explicit_example.py';
import RqHelperAddRequestsExample from '!!raw-loader!roa-loader!./code_examples/storages/helper_add_requests_example.py';
import RqHelperEnqueueLinksExample from '!!raw-loader!roa-loader!./code_examples/storages/helper_enqueue_links_example.py';

import DatasetBasicExample from '!!raw-loader!roa-loader!./code_examples/storages/dataset_basic_example.py';
import DatasetWithCrawlerExample from '!!raw-loader!roa-loader!./code_examples/storages/dataset_with_crawler_example.py';
import DatasetWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_examples/storages/dataset_with_crawler_explicit_example.py';

import KvsBasicExample from '!!raw-loader!roa-loader!./code_examples/storages/kvs_basic_example.py';
import KvsWithCrawlerExample from '!!raw-loader!roa-loader!./code_examples/storages/kvs_with_crawler_example.py';
import KvsWithCrawlerExplicitExample from '!!raw-loader!roa-loader!./code_examples/storages/kvs_with_crawler_explicit_example.py';

import CleaningDoNotPurgeExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_do_not_purge_example.py';
import CleaningPurgeExplicitlyExample from '!!raw-loader!roa-loader!./code_examples/storages/cleaning_purge_explicitly_example.py';

Crawlee offers several storage types for managing and persisting your crawling data. Request-oriented storages, such as the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, help you store and deduplicate URLs, while result-oriented storages, like <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, focus on storing and retrieving scraping results. This guide explains when to use each type, how to interact with them, and how to control their lifecycle.

## Overview

Crawlee's storage system consists of two main layers:
- **Storages** (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>): High-level interfaces for interacting with different storage types.
- **Storage clients** (<ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>, <ApiLink to="class/FileSystemStorageClient">`FileSystemStorageClient`</ApiLink>, etc.): Backend implementations that handle the actual data persistence and management.

For more information about storage clients and their configuration, see the [Storage clients guide](./storage-clients).

```mermaid
---
config:
    class:
        hideEmptyMembersBox: true
---

classDiagram

%% ========================
%% Abstract classes
%% ========================

class Storage {
    <<abstract>>
}

%% ========================
%% Specific classes
%% ========================

class Dataset

class KeyValueStore

class RequestQueue

%% ========================
%% Inheritance arrows
%% ========================

Storage --|> Dataset
Storage --|> KeyValueStore
Storage --|> RequestQueue
```

### Named and unnamed storages

Crawlee supports two types of storages:

- **Named storages**: Persistent storages with a specific name that persist across runs. These are useful when you want to share data between different crawler runs or access the same storage from multiple places.
- **Unnamed storages**: Temporary storages identified by an alias that are scoped to a single run. These are automatically purged at the start of each run (when `purge_on_start` is enabled, which is the default).

### Default storage

Each storage type (<ApiLink to="class/Dataset">`Dataset`</ApiLink>, <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>) has a default instance that can be accessed without specifying `id`, `name` or `alias`. Default unnamed storage is accessed by calling storage's `open` method without parameters. This is the most common way to use storages in simple crawlers. The special alias `"default"` is equivalent to calling `open` without parameters

<RunnableCodeBlock className="language-python" language="python">
    {OpeningExample}
</RunnableCodeBlock>

## Request queue

The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run.

The following code demonstrates the usage of the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:

<Tabs groupId="request_queue">
    <TabItem value="request_queue_basic_example" label="Basic usage" default>
        <RunnableCodeBlock className="language-python" language="python">
            {RqBasicExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="request_queue_with_crawler" label="Usage with Crawler">
        <RunnableCodeBlock className="language-python" language="python">
            {RqWithCrawlerExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="request_queue_with_crawler_explicit" label="Explicit usage with Crawler" default>
        <RunnableCodeBlock className="language-python" language="python">
            {RqWithCrawlerExplicitExample}
        </RunnableCodeBlock>
    </TabItem>
</Tabs>

### Request-related helpers

Crawlee provides helper functions to simplify interactions with the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:

- The <ApiLink to="class/AddRequestsFunction">`add_requests`</ApiLink> function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as a `label` or `user_data`, you have to pass instances of the <ApiLink to="class/Request">`Request`</ApiLink> class to the helper.
- The <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See [Crawl website with relative links](../examples/crawl-website-with-relative-links) example for more details.

<Tabs groupId="request_helpers">
    <TabItem value="request_helper_add_requests" label="Add requests" default>
        <RunnableCodeBlock className="language-python" language="python">
            {RqHelperAddRequestsExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="request_helper_enqueue_links" label="Enqueue links">
        <RunnableCodeBlock className="language-python" language="python">
            {RqHelperEnqueueLinksExample}
        </RunnableCodeBlock>
    </TabItem>
</Tabs>

### Request manager

The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> implements the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> interface, offering a unified API for interacting with various request storage types. This provides a unified way to interact with different request storage types.

If you need custom functionality, you can create your own request storage by subclassing the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> class and implementing its required methods.

For a detailed explanation of the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> and other related components, refer to the [Request loaders guide](https://crawlee.dev/python/docs/guides/request-loaders).

## Dataset

The <ApiLink to="class/Dataset">`Dataset`</ApiLink> is designed for storing structured data, where each entry has a consistent set of attributes, such as products in an online store or real estate listings. Think of a <ApiLink to="class/Dataset">`Dataset`</ApiLink> as a table: each entry corresponds to a row, with attributes represented as columns. Datasets are append-only, allowing you to add new records but not modify or delete existing ones. Every Crawlee project run is associated with a default dataset, typically used to store results specific to that crawler execution. However, using this dataset is optional.

The following code demonstrates basic operations of the dataset:

<Tabs groupId="dataset_storage">
    <TabItem value="dataset_basic_example" label="Basic usage" default>
        <RunnableCodeBlock className="language-python" language="python">
            {DatasetBasicExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="dataset_with_crawler" label="Usage with Crawler">
        <RunnableCodeBlock className="language-python" language="python">
            {DatasetWithCrawlerExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="dataset_with_crawler_explicit" label="Explicit usage with Crawler" default>
        <RunnableCodeBlock className="language-python" language="python">
            {DatasetWithCrawlerExplicitExample}
        </RunnableCodeBlock>
    </TabItem>
</Tabs>

### Dataset-related helpers

Crawlee provides the following helper function to simplify interactions with the <ApiLink to="class/Dataset">`Dataset`</ApiLink>:

- The <ApiLink to="class/PushDataFunction">`push_data`</ApiLink> function allows you to manually add data to the dataset. You can optionally specify the dataset ID or its name.

## Key-value store

The <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink> is designed to save and retrieve data records or files efficiently. Each record is uniquely identified by a key and is associated with a specific MIME type, making the <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink> ideal for tasks like saving web page screenshots, PDFs, or tracking the state of crawlers.

The following code demonstrates the usage of the <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>:

<Tabs groupId="kv_storage">
    <TabItem value="kvs_basic_example" label="Basic usage" default>
        <RunnableCodeBlock className="language-python" language="python">
            {KvsBasicExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="kvs_with_crawler" label="Usage with Crawler">
        <RunnableCodeBlock className="language-python" language="python">
            {KvsWithCrawlerExample}
        </RunnableCodeBlock>
    </TabItem>
    <TabItem value="kvs_with_crawler_explicit" label="Explicit usage with Crawler" default>
        <RunnableCodeBlock className="language-python" language="python">
            {KvsWithCrawlerExplicitExample}
        </RunnableCodeBlock>
    </TabItem>
</Tabs>

To see a real-world example of how to get the input from the key-value store, see the [Screenshots](https://crawlee.dev/python/docs/examples/capture-screenshots-using-playwright) example.

### Key-value store-related helpers

Crawlee provides the following helper function to simplify interactions with the <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>:

- The <ApiLink to="class/GetKeyValueStoreFunction">`get_key_value_store`</ApiLink> function retrieves the key-value store for the current crawler run. If the KVS does not exist, it will be created. You can also specify the KVS's ID or its name.

## Cleaning up the storages

By default, Crawlee cleans up all unnamed storages (including the default one) at the start of each run, so every crawl begins with a clean state. This behavior is controlled by <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink> (default: True). In contrast, named storages are never purged automatically and persist across runs. The exact behavior may vary depending on the storage client implementation.

### When purging happens

The cleanup occurs as soon as a storage is accessed:
- When opening a storage explicitly (e.g., <ApiLink to="class/RequestQueue#open">`RequestQueue.open`</ApiLink>, <ApiLink to="class/Dataset#open">`Dataset.open`</ApiLink>, <ApiLink to="class/KeyValueStore#open">`KeyValueStore.open`</ApiLink>).
- When using helper functions that implicitly open storages (e.g., <ApiLink to="class/PushDataFunction">`push_data`</ApiLink>).
- Automatically when <ApiLink to="class/BasicCrawler#run">`BasicCrawler.run`</ApiLink> is invoked.

### Disabling automatic purging

To disable automatic purging, set `purge_on_start=False` in your configuration:

<RunnableCodeBlock className="language-python" language="python">
    {CleaningDoNotPurgeExample}
</RunnableCodeBlock>

### Manual purging

Purge on start behavior just triggers the storage's `purge` method, which removes all data from the storage. If you want to purge the storage manually, you can do so by calling the `purge` method on the storage instance. Or if you want to delete the storage completely, you can call the `drop` method on the storage instance, which will remove the storage, including metadata and all its data.

<RunnableCodeBlock className="language-python" language="python">
    {CleaningPurgeExplicitlyExample}
</RunnableCodeBlock>

Note that purging behavior may vary between storage client implementations. For more details on storage configuration and client implementations, see the [Storage clients guide](./storage-clients).

## Conclusion

This guide introduced you to the different storage types available in Crawlee and how to interact with them. You learned about the distinction between named storages (persistent across runs) and unnamed storages with aliases (temporary and purged on start). You discovered how to manage requests using the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> and store and retrieve scraping results using the <ApiLink to="class/Dataset">`Dataset`</ApiLink> and <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>. You also learned how to use helper functions to simplify interactions with these storages and how to control storage cleanup behavior.

If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
